Suppose you have a function f(x) and you want to find the value of x that makes f(x) as large as possible. This is called a maximum.
Our issue is thus to find various forms of maxima and minima. We denote this problem as follows:
Number
Objective Name
Mathematical expression
Key elements
(1)
Unconstrained Maximization
\max_x f(x)
The only restriction on f is the nature of its admissible inputs \mathbf{dom}(f)
(2)
Finding the global maximizer
x^\star \gets \argmax_x f(x)
There is generally a more restriction set of assumptions than in (1) to find the precise x^\star for which f is minimized
(3)
Finding the a maximiser within a constraint set
\underset{x\in\mathcal{X}}{\max f(x)}
In this case, there is no possibility for x to go outside of a predefined set \mathcal{X}, this is very useful in real-life applications
When is calculus useful and not useful ? Here’s a first reason it might or might not be useful: the domain\mathcal{X} of the input variable, or (which is often the same), the domain \mathbf{dom}(f) or our function.
A pretty important part of the problem is what we call the domain of the function. This is the set of values that x can take on. For example, if f(x) = x^2, then the domain is the set of real numbers \mathbb{R}.
Domains change everything because, for example, if your domain \mathbf{dom}(X) is discrete, then the image \mathbf{im}_f(X) (also denoted f(\mathbf{dom}(X))) is also discrete, and you just end up with the array-sorting problem, for which efficient algorithms like quicksort exist..
The derivative of a function f(x)\colon \mathbb{R}\to\mathbb{R} is a function f'(x) is just what everyone knows, that is the limit of
f'(x) = \lim_{h\to 0} \frac{f(x+h) - f(x)}{h}
When you have a function of several variables, say f(x_1, x_2, \dots, x_n), whose values are still real numbers, then you can still behave as if nothing happened, because if you fix the n-1 variables
Then the univariate functiong(x_i) is still a function of a single variable, and you can still take its derivative:
g\colon x \mapsto f(\bar{x}_{\scriptscriptstyle-i}, x) \coloneqq f(\bar{x}_1,\ldots,\bar{x}_{\scriptscriptstyle i-1},x_i, x_{\scriptscriptstyle i+1},\dots\bar{x}_n)
We have a bunch of fancy notation for this, namely \partial_i f(\bar{x}) for short, or \frac{\partial f}{\partial x_i}(\bar{x}). We also sometimes speak of the directional derivative of f in the direction of \bar{x}, which is often denoted D_{\bar{x}}f.
What is not clear though, is how we relate each derivative to the others. How does \partial_i f(\bar{x}) relate to \partial_j f(\bar{x})?
There is a good theorem to explain this !
1.1.2 The Jacobian
The theorem that defines and cements the derivatives is the following:
Let f\colon \mathbb{R}^n \to \mathbb{R}^m be a function. Then, at each point x where f is differentiable, there is a unique matrixM(x) so that: f(x+h) \approx f(x) + M(x)h +\varepsilon (h) where \varepsilon (h) is a small vector in \mathbb{R}^m that goes to 0 faster than h does1.
1.1.3 The Proof
The uniqueness is fairly easy to prove, because if there are two matrices M_1 and M_2 such that f(x+h) \approx f(x) + M_1h and f(x+h) \approx f(x) + M_2h, then we can subtract the two equations, (M_1 - M_2)h = \varepsilon_1 (h) + \varepsilon_2 (h)
divide by \lvert{h}\rvert to get that the LHS tends to M_1 - M_2 as h\to 0, and the RHS tends to 0 as h\to 0, so we get that M_1 - M_2 = 0.
The existence is essentially the notion that we take matrices as rectangular arrays of numbers. In that sense, if the matrix M(x) has coefficients \left(\smash{\alpha_{ij}(x)}\right)_{i,j}, then we can write, our equation is merely a statement that is 1-D (which we know how to deal with):
This last set of equation is just a m\times n grid of first order approximations for each f_j at each x_i, and we can solve for the \alpha_{ij} by just taking the limit as h_i\to 0 for each i2.
If you hadn’t guessed before, each of these equations are looking very much like \varphi(x+h) \approx \varphi(x) + \varphi'(x)h + \varepsilon(h)
we know in 1D, and we can just take the limit as h\to 0 to get that \varphi'(x) = \alpha_{ij}(x).
What this proves is 3 things:
The Jacobian is a matrix, and it is unique.
The Jacobian is a linear operator, and it is continuous.
The Jacobian matrix is populated by the partial derivatives of f, that is
The jacobian sends points from a n-dimensional space to a m-dimensional space, so the Jacobian matrix is a m\times n matrix. This is why the partial derivatives are written in the order \dfrac{\partial f_j}{\partial x_i}, because the jth row of the matrix is the partial derivatives of f_j with respect to the x_i’s.
This is very counterintuitive to anyone who works with the gradient \nabla f(x), which is a n-dimensional vector, and is the transpose of the Jacobian matrix.
1.1.4 A small example using python
Let’s take a look at a small example I made using python’s sympy library. Through graphing increasingly complex 2D functions on an online tool, I found a function that is complex enough for us to have fun, but not too complex that we can’t understand it.
with A(x,y) = x^{\frac{1}{3}} - y^{\frac{2}{3}} and B(x,y) = \sin \left(x\right)\cdot \cos \left(x\right).
We can plot this function using plotly:
Code
import plotly.graph_objects as goimport requests as rqimport jsonfrom pathlib import Pathtemplate = json.loads( Path("../plotlyTemplate.json").read_text())data = rq.get("https://storage.googleapis.com/open.data.arnov.dev/static/plots/optimization/CostFunction2D.json")fig = go.FigureWidget(data=data.json())fig.update_layout(**template,)fig.show()
A 3D plot of the function G(x,y) = \log\left(1 + \dfrac{A(x,y)^2}{\gamma + B(x,y)^2}\right)
in the sense of a norm: \lvert x\rvert, \varepsilon(h)/\lvert x\rvert\to 0 as \lvert x\rvert\to 0.↩︎
There is a subtelty here, because we can’t really just let any h_i to zero before the others arbitrarily, but if we do it in any order according to most used norms (here we’ll use the euclidean norm \lvert \cdot \rvert), then we get that the limit exists and is unique.)↩︎
Source Code
---include-in-header: | \newcommand\dom{\mathop{\mathrm{dom}}\limits} \newcommand\im{\mathop{\mathrm{im}}\limits} \newcommand\supp{\mathop{\mathrm{supp}}\limits} \newcommand\id{\mathop{\mathrm{id}}\limits} \newcommand\tr{\mathop{\mathrm{tr}}\limits} \newcommand\diag{\mathop{\mathrm{diag}}\limits} \newcommand\sign{\mathop{\mathrm{sign}}\limits} \newcommand\sgn{\mathop{\mathrm{sgn}}\limits} \newcommand\dist{\mathop{\mathrm{dist}}\limits} \newcommand\argmin{\mathop{\mathrm{argmin}}\limits} \newcommand\argmax{\mathop{\mathrm{argmax}}\limits} \newcommand\minimax[2]{\mathop{\mathrm{min}_{#1}\mathrm{max}_{#2}}\limits} \newcommand\maximin[2]{\mathop{\mathrm{max}_{#1}\mathrm{min}_{#2}}\limits}---# Review of Optimization basics {.Review}## First definitions {.FirstDefs}### Why we need CalculusSuppose you have a function $f(x)$ and you want to find the value of $x$ that makes $f(x)$ as large as possible. This is called a *maximum*. Our issue is thus to find various forms of maxima and minima. We denote this problem as follows:| Number | Objective Name | Mathematical expression | Key elements || --- | --- | ----- | ------ || $(1)$ | Unconstrained Maximization | $\max_x f(x)$ | The only restriction on $f$ is the nature of its admissible inputs $\mathbf{dom}(f)$| $(2)$ | Finding the global maximizer | $x^\star \gets \argmax_x f(x)$ | There is generally a more restriction set of assumptions than in $(1)$ to find the precise $x^\star$ for which $f$ is minimized || $(3)$ | Finding the a maximiser within a constraint set | $\underset{x\in\mathcal{X}}{\max f(x)}$ | In this case, there is no possibility for $x$ to go outside of a predefined set $\mathcal{X}$, this is very useful in real-life applications |When is calculus useful and not useful ? Here's a first reason it might or might not be useful: the __domain__ $\mathcal{X}$ of the input variable, or (which is often the same), the domain $\mathbf{dom}(f)$ or our function.A pretty important part of the problem is what we call the *domain* of the function. This is the set of values that $x$ can take on. For example, if $f(x) = x^2$, then the domain is the set of real numbers $\mathbb{R}$. Domains change everything because, for example, if your domain $\mathbf{dom}(X)$ is discrete, then the image $\mathbf{im}_f(X)$ (also denoted $f(\mathbf{dom}(X))$) is also discrete, and you just end up with the array-sorting problem, [for which efficient algorithms like quicksort exist.](https://en.wikipedia.org/wiki/Quicksort).**The derivative** of a function $f(x)\colon \mathbb{R}\to\mathbb{R}$ is a function $f'(x)$ is just what everyone knows, that is the limit of $$f'(x) = \lim_{h\to 0} \frac{f(x+h) - f(x)}{h}$$When you have a function of several variables, say $f(x_1, x_2, \dots, x_n)$, whose values are still real numbers, then you can still behave as if nothing happened, because if you fix the $n-1$ variables$$\bar{x}_{\scriptscriptstyle-i} \coloneqq (\bar{x}_1,\ldots,\bar{x}_{\scriptscriptstyle i-1},x_{\scriptscriptstyle i+1},\dots\bar{x}_n)$$Then the **univariate function** $g(x_i)$ is still a function of a single variable, and you can still take its derivative:$$ g\colon x \mapsto f(\bar{x}_{\scriptscriptstyle-i}, x) \coloneqq f(\bar{x}_1,\ldots,\bar{x}_{\scriptscriptstyle i-1},x_i, x_{\scriptscriptstyle i+1},\dots\bar{x}_n)$$We have a bunch of fancy notation for this, namely $\partial_i f(\bar{x})$ for short, or $\frac{\partial f}{\partial x_i}(\bar{x})$.We also sometimes speak of the **directional derivative** of $f$ in the direction of $\bar{x}$, which is often denoted $D_{\bar{x}}f$.What is not clear though, is how we relate each derivative to the others. How does $\partial_i f(\bar{x})$ relate to $\partial_j f(\bar{x})$?There is a good theorem to explain this !### The JacobianThe theorem that defines and cements the derivatives is the following:> Let $f\colon \mathbb{R}^n \to \mathbb{R}^m$ be a function. Then, at each point $x$ where $f$ is differentiable, there is a **unique matrix** $M(x)$ so that:> $$ f(x+h) \approx f(x) + M(x)h +\varepsilon (h)$$> where $\varepsilon (h)$ is a small vector in $\mathbb{R}^m$ that goes to $0$ faster than $h$ does[^1].[^1]: in the sense of a norm: $\lvert x\rvert$, $\varepsilon(h)/\lvert x\rvert\to 0$ as $\lvert x\rvert\to 0$.### The ProofThe uniqueness is fairly easy to prove, because if there are two matrices $M_1$ and $M_2$ such that $f(x+h) \approx f(x) + M_1h$ and $f(x+h) \approx f(x) + M_2h$, then we can subtract the two equations,$$(M_1 - M_2)h = \varepsilon_1 (h) + \varepsilon_2 (h)$$divide by $\lvert{h}\rvert$ to get that the LHS tends to $M_1 - M_2$ as $h\to 0$, and the RHS tends to $0$ as $h\to 0$, so we get that $M_1 - M_2 = 0$.The existence is essentially the notion that we take matrices as rectangular arrays of numbers. In that sense, if the matrix $M(x)$ has coefficients $\left(\smash{\alpha_{ij}(x)}\right)_{i,j}$, then we can write, our equation is merely a statement that is 1-D (which we know how to deal with):\begin{aligned}f(x+h) &= f(x) + M(x)h +\varepsilon (h) &\\\iff f_j(x+h) &= f_j(x) + \sum_{i=1}^m \alpha_{ij}(x)h_j + \varepsilon_j (h) &\forall j\in\{1,\dots m\}\\\iff f_j(x_i+h_i) &= f_j(x_i) + \sum_{i=1}^m \alpha_{ij}(x_i)h_j + \varepsilon_j (h_i) &\forall j\in\{1,\dots m\},\,\forall i\in\{1,\dots n\}\\\end{aligned}This last set of equation is just a $m\times n$ grid of first order approximations for each $f_j$ at each $x_i$, and we can solve for the $\alpha_{ij}$ by just taking the limit as $h_i\to 0$ for each $i$[^2]. If you hadn't guessed before, each of these equations are looking very much like $$\varphi(x+h) \approx \varphi(x) + \varphi'(x)h + \varepsilon(h)$$ we know in 1D, and we can just take the limit as $h\to 0$ to get that $\varphi'(x) = \alpha_{ij}(x)$. What this proves is 3 things:1. The Jacobian is a matrix, and it is unique.2. The Jacobian is a linear operator, and it is continuous.3. The Jacobian matrix is populated by the partial derivatives of $f$, that is$$ \textbf{Jac}[f](x) = \left[\dfrac{\partial f_j}{\partial x_i}(x)\right]_{\substack{1\leq j \leq m\\1\leq j \leq n}}$$::: {.callout-caution}## Careful about the dimensions !The jacobian sends points from a $n$-dimensional space to a $m$-dimensional space, so the Jacobian matrix is a $m\times n$ matrix. This is why the partial derivatives are written in the order $\dfrac{\partial f_j}{\partial x_i}$, because the $j$th row of the matrix is the partial derivatives of $f_j$ with respect to the $x_i$'s. This is very counterintuitive to anyone who works with the gradient $\nabla f(x)$, which is a $n$-dimensional vector, and is the transpose of the Jacobian matrix.:::### A small example using pythonLet's take a look at a small example I made using python's `sympy` library. Through [graphing increasingly complex 2D functions on an online tool](https://www.math3d.org/), I found a function that is complex enough for us to have fun, but not too complex that we can't understand it.The function is:$$G(x,y) = \log\left[1+\left(x^{\frac{1}{3}} - y^{\frac{2}{3}}\right)^2\left(0.1+\sin \left(x\right)^2\cdot \cos \left(x\right)^2\right)^{-1}\right]$$But I think it's easier to understand if we write it as:$$ G(x,y) = \log\left(1 + \dfrac{A(x,y)^2}{\gamma + B(x,y)^2}\right)$$with $A(x,y) = x^{\frac{1}{3}} - y^{\frac{2}{3}}$ and $B(x,y) = \sin \left(x\right)\cdot \cos \left(x\right)$.We can plot this function using `plotly`:```{python}# | fig-cap: A 3D plot of the function $G(x,y) = \log\left(1 + \dfrac{A(x,y)^2}{\gamma + B(x,y)^2}\right)$import plotly.graph_objects as goimport requests as rqimport jsonfrom pathlib import Pathtemplate = json.loads( Path("../plotlyTemplate.json").read_text())data = rq.get("https://storage.googleapis.com/open.data.arnov.dev/static/plots/optimization/CostFunction2D.json")fig = go.FigureWidget(data=data.json())fig.update_layout(**template,)fig.show()```[^2]: There is a subtelty here, because we can't really just let any $h_i$ to zero before the others arbitrarily, but if we do it in any order according to most used norms (here we'll use the euclidean norm $\lvert \cdot \rvert$), then we get that the limit exists and is unique.)